An auditory-based measure for improved phone segment concatenation
نویسندگان
چکیده
This paper describes a new auditory-based distance measure intended for use in a concatenated synthesis technique wherein timeand frequency-domain characteristics are used to perform natural-sounding speaker synthesis. Whereas most concatenation systems use large databases (often +100,000 units), we begin from a small, limited database (approx. 400 units) and use a new spectral distortion measure to aid in the selection of phones for optimal concatenation. At the transition between speech segments, the new auditory-based distance metric assesses perceived discontinuities in the frequency domain. The distortion measure, which employs the Carney auditory model, is used to select phones which minimize the perceived distortion between concatenated segments. Moreover, timeand frequency-domain methods can shape the prosodic and spectral characteristics of each speech segment. The nal results demonstrate improved performance over standard concatenation methods applied to small databases.
منابع مشابه
Organizing phone models based on piecewise linear segment lattices of speech samples
Aiming at robust speech recognition, we have proposed a framework for “phonological concept formation,” which is the task of acquiring an efficient representation of phonemes from spoken word samples without using any transcriptions except for the lexical classification of the words. In order to implement this task, we propose the “piecewise linear segment lattice (PLSL)” model for phoneme repr...
متن کاملGeneralized phone modeling based on piecewise linear segment lattice
The goal of this work is to model phone-like units automatically from spoken word samples without using any transcriptions except for the lexical identi cation of the words. In order to implement this task, we have proposed the \piecewise linear segment lattice (PLSL)" model for phoneme representation. The structure of this model is a lattice of segments, each of which is represented as regress...
متن کاملA comparison of spectral smoothing methods for segment concatenation based speech synthesis
There are many scenarios in both speech synthesis and coding in which adjacent time-frames of speech are spectrally discontinuous. This paper addresses the topic of improving concatenative speech synthesis with a limited database by proposing methods to smooth, adjust, or interpolate the spectral transitions between speech segments. The objective is to produce natural-sounding speech via segmen...
متن کاملHigh-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion
Text-to-Speech (TTS) is a useful technology that converts any text into a speech signal. It can be utilized for various purposes, e.g. car navigation, announcements in railway stations, response services in telecommunications, and e-mail reading. Corpus-based TTS makes it possible to dramatically improve the naturalness of synthetic speech compared with the early TTS. However, no general-purpos...
متن کاملAutomatically Creating a Diphone Set from a Speech Database
This paper presents a measure that scores various aspects of phone quality. The measure is designed to penalize phone instances with one or several characteristics that are not desirable in concatenation-based speech synthesis. Depending on the phone type, these aspects amongst others include spectrum, phase, fundamental frequency, duration, voicing and plosive quality. We applied this quality ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997